library(tidyverse)
library(visdat)
library(pointblank)
library(readxl)
library(skimr)Data Validation in Excel and R
Overview
There are many opportunities for human or instrument error to affect data. Ideally, you want to find those errors and fix them early and often! This workshop introduces some tools in Excel and R to avoid making mistakes in data entry and data collection, and to detect the ones you inadvertently make.
Learning Objectives
Understand best practices for entering data and fixing errors in data
Use Excel data validation tools to prevent data entry errors
Compare data entered by two people in R to check for data entry mistakes
Explore data summaries to check for errors
Get the gist of how you can use the
pointblankpackage to perform data validation checks
Software needed
You’ll need access to Excel, R, RStudio, and the following R packages:
If you can, install the development version of pointblank using this R code:
if(!require("remotes")) {
install.packages("remotes")
}
remotes::install_github("rich-iannone/pointblank")Avoiding mistakes in data entry
Set up validation tools in your data entry spreadsheet to stop data entry errors in their tracks!
Data Validation Tools in Excel
Select a column (or cells) and choose
Data > Validation …from the menuUse “list” to restrict to specific values for categorical data
Use “whole number” for count data
Can also be set up after data entry. Highlight invalid data with “Circle Invalid Data” from toolbar
Watch out for Excel autocorrect!
To stop Excel from converting entries to dates:
Explicitly set all column types to numeric, text, date, etc.
Make sure no columns are set to “general”
Double-entry Method
Two people enter the same data, then compare programatically.
In the
datafolder, there are two versions of a dataset—one entered by Eric and one entered by Jessica.
eric <- read_excel("data/data_eric.xlsx")
jessica <- read_excel("data/data_jessica.xlsx")Compare visually with visdat
We can compare them a couple of ways. First, we can compare them visually using the visdat package. This only works if the two datasets are the same dimensions.
vis_compare(eric, jessica)Compare with dplyr::anti_join()
First add row numbers to make it easier to find mistakes in Excel.
# add rownumbers that match Excel (headers are row 1)
eric <- eric |> mutate(row = 2:(n()+1), .before = plot)
jessica <- jessica |> mutate(row = 2:(n()+1), .before = plot)anti_join() takes two data frames and returns only rows that differ between them.
#values in `eric` that are different in `jessica`
anti_join(eric, jessica)Joining, by = c("row", "plot", "plant_id", "shts_2000", "ht_2000", "flwr_2000",
"shts_2001", "ht_2001", "flwr_2001")
# A tibble: 7 × 9
row plot plant_id shts_2000 ht_2000 flwr_2000 shts_2001 ht_2001 flwr_2001
<int> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 14 A 33 1 27 <NA> 2 23 <NA>
2 96 B 142 3 56 yse 4 52 <NA>
3 121 B 162 2 19 <NA> 3 15 <NA>
4 159 C 226 2 38 <NA> 1 32 <NA>
5 167 C 235 1 7 <NA> 1 8 <NA>
6 168 C 236 1 12 <NA> 1 13 <NA>
7 225 E 306 1 13 <NA> NA NA <NA>
#values in `jessica` that are different in `eric`
anti_join(jessica, eric)Joining, by = c("row", "plot", "plant_id", "shts_2000", "ht_2000", "flwr_2000",
"shts_2001", "ht_2001", "flwr_2001")
# A tibble: 7 × 9
row plot plant_id shts_2000 ht_2000 flwr_2000 shts_2001 ht_2001 flwr_2001
<int> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 14 A 33 1 21 <NA> 2 23 <NA>
2 96 B 142 3 56 yes 4 52 <NA>
3 121 C 162 2 19 <NA> 3 15 <NA>
4 159 C 226 2 33 <NA> 1 32 <NA>
5 167 C 233 1 8 <NA> 1 7 <NA>
6 168 C 234 1 6 <NA> NA NA <NA>
7 225 E 306 1 13 <NA> 1 12 <NA>
Errors include:
- row 14: messy handwriting? (21 or 27)
- row 96: typo in flwr_2020
- row 121: discrepency in plot ID
- row 159: messy handwriting? (33 or 38)
- row 167 & 168: completely different rows
- row 225: missing data in Eric’s version
#after fixing data-entry errors, we get `data_resolved.csv`
plants <- read_excel("data/data_resolved.xlsx")Explore data summaries
- You can’t check for errors if you don’t get to know your data!
- Use
skimr::skim()to get a nicely formatted summary - Look for number of unique values for categorical variables
- Look for long tails or strange patterns in mini-histograms for numeric variables
skimr::skim(plants)| Name | plants |
| Number of rows | 247 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| plot | 0 | 1.00 | 1 | 1 | 0 | 5 | 0 |
| flwr_2000 | 243 | 0.02 | 3 | 3 | 0 | 1 | 0 |
| flwr_2001 | 245 | 0.01 | 3 | 3 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| plant_id | 0 | 1.00 | 167.79 | 97.25 | 1.0 | 83.5 | 166 | 247.5 | 337 | ▇▇▇▇▇ |
| shts_2000 | 26 | 0.89 | 2.06 | 1.07 | 1.0 | 1.0 | 2 | 3.0 | 5 | ▇▇▃▂▁ |
| ht_2000 | 25 | 0.90 | 26.95 | 13.85 | 1.1 | 17.0 | 24 | 34.0 | 91 | ▆▇▃▁▁ |
| shts_2001 | 26 | 0.89 | 2.10 | 1.11 | 1.0 | 1.0 | 2 | 3.0 | 6 | ▇▂▁▁▁ |
| ht_2001 | 26 | 0.89 | 26.88 | 14.47 | 5.0 | 16.0 | 24 | 35.0 | 80 | ▇▇▃▁▁ |
Or get a more detailed breakdown by running skim() on a grouped data frame:
plants |>
group_by(plot) |>
skim()| Name | group_by(plants, plot) |
| Number of rows | 247 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 5 |
| ________________________ | |
| Group variables | plot |
Variable type: character
| skim_variable | plot | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| flwr_2000 | A | 54 | 0.02 | 3 | 3 | 0 | 1 | 0 |
| flwr_2000 | B | 63 | 0.02 | 3 | 3 | 0 | 1 | 0 |
| flwr_2000 | C | 65 | 0.02 | 3 | 3 | 0 | 1 | 0 |
| flwr_2000 | D | 35 | 0.03 | 3 | 3 | 0 | 1 | 0 |
| flwr_2000 | E | 26 | 0.00 | NA | NA | 0 | 0 | 0 |
| flwr_2001 | A | 55 | 0.00 | NA | NA | 0 | 0 | 0 |
| flwr_2001 | B | 64 | 0.00 | NA | NA | 0 | 0 | 0 |
| flwr_2001 | C | 66 | 0.00 | NA | NA | 0 | 0 | 0 |
| flwr_2001 | D | 34 | 0.06 | 3 | 3 | 0 | 1 | 0 |
| flwr_2001 | E | 26 | 0.00 | NA | NA | 0 | 0 | 0 |
Variable type: numeric
| skim_variable | plot | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| plant_id | A | 0 | 1.00 | 37.18 | 22.51 | 1.0 | 17.50 | 35.0 | 57.50 | 73 | ▇▆▆▆▇ |
| plant_id | B | 0 | 1.00 | 118.34 | 25.71 | 75.0 | 97.50 | 119.5 | 141.25 | 160 | ▆▇▆▆▇ |
| plant_id | C | 0 | 1.00 | 206.36 | 25.24 | 162.0 | 188.25 | 208.5 | 227.75 | 246 | ▇▅▇▇▇ |
| plant_id | D | 0 | 1.00 | 274.17 | 13.41 | 249.0 | 263.75 | 275.5 | 285.25 | 299 | ▅▇▇▇▅ |
| plant_id | E | 0 | 1.00 | 320.54 | 11.07 | 304.0 | 310.25 | 319.5 | 330.75 | 337 | ▇▅▅▅▇ |
| shts_2000 | A | 4 | 0.93 | 2.08 | 1.02 | 1.0 | 1.00 | 2.0 | 2.50 | 5 | ▆▇▂▂▁ |
| shts_2000 | B | 8 | 0.88 | 1.93 | 0.85 | 1.0 | 1.00 | 2.0 | 2.00 | 4 | ▆▇▁▃▁ |
| shts_2000 | C | 8 | 0.88 | 2.28 | 1.28 | 1.0 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▂ |
| shts_2000 | D | 4 | 0.89 | 1.69 | 1.03 | 1.0 | 1.00 | 1.0 | 2.00 | 5 | ▇▅▁▁▁ |
| shts_2000 | E | 2 | 0.92 | 2.29 | 1.00 | 1.0 | 1.75 | 2.0 | 3.00 | 4 | ▆▇▁▇▃ |
| ht_2000 | A | 3 | 0.95 | 28.56 | 15.24 | 9.0 | 17.00 | 24.0 | 35.00 | 74 | ▇▇▃▁▂ |
| ht_2000 | B | 8 | 0.88 | 25.30 | 10.44 | 8.0 | 18.00 | 23.5 | 32.25 | 56 | ▅▇▃▂▁ |
| ht_2000 | C | 8 | 0.88 | 27.84 | 15.43 | 6.0 | 18.25 | 24.5 | 33.75 | 91 | ▇▇▂▁▁ |
| ht_2000 | D | 4 | 0.89 | 26.75 | 15.08 | 8.0 | 14.00 | 28.0 | 38.25 | 60 | ▇▆▃▂▃ |
| ht_2000 | E | 2 | 0.92 | 25.42 | 12.45 | 1.1 | 16.00 | 25.0 | 34.25 | 49 | ▂▇▅▅▃ |
| shts_2001 | A | 5 | 0.91 | 2.32 | 1.13 | 1.0 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▇▂▂ |
| shts_2001 | B | 6 | 0.91 | 1.84 | 0.83 | 1.0 | 1.00 | 2.0 | 2.00 | 4 | ▇▇▁▃▁ |
| shts_2001 | C | 9 | 0.86 | 2.30 | 1.32 | 1.0 | 1.00 | 2.0 | 3.00 | 6 | ▇▂▂▁▁ |
| shts_2001 | D | 4 | 0.89 | 1.78 | 1.01 | 1.0 | 1.00 | 2.0 | 2.00 | 5 | ▇▇▁▁▁ |
| shts_2001 | E | 2 | 0.92 | 2.17 | 1.09 | 1.0 | 1.00 | 2.0 | 3.00 | 4 | ▇▇▁▃▃ |
| ht_2001 | A | 5 | 0.91 | 30.90 | 17.27 | 5.0 | 19.25 | 27.5 | 38.00 | 80 | ▆▇▃▁▁ |
| ht_2001 | B | 6 | 0.91 | 23.55 | 10.84 | 6.0 | 16.00 | 22.5 | 29.50 | 52 | ▅▇▅▃▁ |
| ht_2001 | C | 9 | 0.86 | 27.53 | 14.41 | 7.0 | 16.00 | 25.0 | 36.00 | 71 | ▇▆▅▂▁ |
| ht_2001 | D | 4 | 0.89 | 25.59 | 16.20 | 8.0 | 13.75 | 19.5 | 31.25 | 70 | ▇▃▁▂▁ |
| ht_2001 | E | 2 | 0.92 | 26.71 | 12.28 | 10.0 | 16.00 | 25.0 | 36.00 | 49 | ▇▆▂▃▅ |
Explore data visually
visdat::vis_guess()can help spot inconsistencies- I’ll change one of the plots to a number to demonstrate
#change plot in the 10th row to "1"
plants$plot[10] <- 1
#doesn't change the type of the column
class(plants$plot)[1] "character"
#but vis_guess() spots the mistake!
visdat::vis_guess(plants)It also spots a decimal in the ht_2000 column (but it’s hard to see the green line)
Data validation pipelines with pointblank
library(pointblank)pointblankprovides 6 (six!) workflows for validating dataThe Data Quality Reporting Workflow (VALID-1) is probably most useful for this group
Start with a data frame, create an “agent”, tell it what to expect of your data with validation functions, and let it “interrogate” your data
Output is a HTML table with buttons to download CSV files of any data that didn’t pass your validations
pointblank demo
- Decide on “action levels”. Can set a number or fraction of rows as a threshold for a warning or error
al <- action_levels(warn_at = 1, stop_at = .02)
al-- The `action_levels` settings
WARN failure threshold of 1test units.
STOP failure threshold of 0.02 of all test units.
----
- Create agent
agent <-
create_agent(
tbl = plants, #our data example from before
label = "plants 2000 & 2001",
actions = al
)Specify validation conditions
- Basic checks on column types with
col_is_*()functions - Check column values with
col_vals_*()functions - Check rows (e.g. duplicate rows) with
rows_*()functions
- Basic checks on column types with
agent_informed <-
agent |>
col_is_character(columns = plot) |> #plot should be character
col_is_numeric(columns = c(shts_2000, shts_2001)) |> #shts should be numeric
col_vals_in_set(columns = plot, set = LETTERS[1:10]) |> #plot should be A-E
col_vals_lt( #expect shts < 5
columns = c(shts_2000, shts_2001),
value = 5,
na_pass = TRUE
) |>
rows_distinct(columns = vars(plant_id)) #no duplicate plant IDs
#'TODO: ^this should work as just `columns = plant_id` but doesn't due to a bug.
#'Change if this gets fixed before workshop.
#'https://github.com/rich-iannone/pointblank/issues/416- Interrogate!
agent_informed |> interrogate()| Pointblank Validation | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| plants 2000 & 2001
tibble
plantsWARN
1
STOP
0.02
NOTIFY
—
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
| 1 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 2 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 3 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 4 | col_vals_in_set()
|
|
✓ |
247 |
2461 |
10 |
● |
○ |
— |
||||
| 5 | col_vals_lt()
|
|
✓ |
247 |
2401 |
70 |
● |
● |
— |
||||
| 6 | col_vals_lt()
|
|
✓ |
247 |
2391 |
80 |
● |
● |
— |
||||
| 7 | rows_distinct()
|
— |
|
✓ |
247 |
2451 |
20 |
● |
○ |
— |
|||
| 2022-06-21 12:50:10 EDT < 1 s 2022-06-21 12:50:10 EDT | |||||||||||||
Click the blue “CSV” buttons above to download a .csv file of just the rows that failed that particular validation
Flexible validations
If a validation function you need doesn’t exist, you can use col_vals_expr()
E.g. let’s add a validation that height is measured to the nearest cm.
agent_informed <-
agent_informed |>
col_vals_expr(~ ht_2000 %% 1 == 0) |>
col_vals_expr(~ ht_2001 %% 1 == 0)
agent_informed |> interrogate()| Pointblank Validation | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| plants 2000 & 2001
tibble
plantsWARN
1
STOP
0.02
NOTIFY
—
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
| 1 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 2 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 3 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 4 | col_vals_in_set()
|
|
✓ |
247 |
2461 |
10 |
● |
○ |
— |
||||
| 5 | col_vals_lt()
|
|
✓ |
247 |
2401 |
70 |
● |
● |
— |
||||
| 6 | col_vals_lt()
|
|
✓ |
247 |
2391 |
80 |
● |
● |
— |
||||
| 7 | rows_distinct()
|
— |
|
✓ |
247 |
2451 |
20 |
● |
○ |
— |
|||
| 8 | col_vals_expr()
|
— |
|
✓ |
222 |
2211 |
10 |
● |
○ |
— |
|||
| 9 | col_vals_expr()
|
— |
|
✓ |
221 |
2211 |
00 |
○ |
○ |
— |
— | ||
| 2022-06-21 12:50:12 EDT < 1 s 2022-06-21 12:50:13 EDT | |||||||||||||
Create new columns to test on the fly
“preconditions” let you manipulate the data before a check is run within a single validation step.
E.g. check that height doesn’t change by more than 50 cm from 2000 to 2001
agent_informed |>
col_vals_lt(
columns = ht_change, #doesn't exist yet
value = 50,
na_pass = TRUE,
# creates a new column on the fly:
preconditions = function(df) mutate(df, ht_change = ht_2001 - ht_2000)
) |>
interrogate()| Pointblank Validation | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| plants 2000 & 2001
tibble
plantsWARN
1
STOP
0.02
NOTIFY
—
|
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
| 1 | col_is_character()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 2 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 3 | col_is_numeric()
|
— |
|
✓ |
1 |
11 |
00 |
○ |
○ |
— |
— | ||
| 4 | col_vals_in_set()
|
|
✓ |
247 |
2461 |
10 |
● |
○ |
— |
||||
| 5 | col_vals_lt()
|
|
✓ |
247 |
2401 |
70 |
● |
● |
— |
||||
| 6 | col_vals_lt()
|
|
✓ |
247 |
2391 |
80 |
● |
● |
— |
||||
| 7 | rows_distinct()
|
— |
|
✓ |
247 |
2451 |
20 |
● |
○ |
— |
|||
| 8 | col_vals_expr()
|
— |
|
✓ |
222 |
2211 |
10 |
● |
○ |
— |
|||
| 9 | col_vals_expr()
|
— |
|
✓ |
221 |
2211 |
00 |
○ |
○ |
— |
— | ||
| 10 | col_vals_lt()
|
|
✓ |
247 |
2461 |
10 |
● |
○ |
— |
||||
| 2022-06-21 12:50:15 EDT < 1 s 2022-06-21 12:50:15 EDT | |||||||||||||
Publishing validation reports
Students, faculty, and staff at University of Arizona have access to RStudio Connect which allows you to publish an RMarkdown document to the web with a single click. (Learn More)
Data validation can be automated in a variety of ways. If you are interested in more advanced applications of data validation for your lab, contact us!
Fixing mistakes
For true mistakes in data entry (paper -> spreadsheet), probably ok to just edit raw data
For other errors, best practice:
- Don’t edit raw data!
- Record all changes to raw data (e.g. by using an R script to make them)
- Flag observations that have been changed
- Publish raw data, cleaning steps/scripts, and “cleaned” data
Keep an eye out for future workshops on data wrangling/tidying/cleaning
Help
Feel free to drop by the CCT Data Science Team office hours, which happens every Tuesday morning. We would love to help you with your R questions about date/time, and more!
You can also make an appointment with Eric to discuss this content and get troubleshooting help.